%>%%>%“Web scraping is the process of automatically mining data or collecting information from the World Wide Web.”
– Wikipedia
Web scraping is a flexible method to extract data from the internet. It can involve extracting numerical or text data.
There are many uses for web scraping, including but not limited to:
Good news! You can easily check with the robotstxt package.
Netflix does not allow you to scrape their site.
“HTML is the standard markup language for creating Web pages.”
– W3Schools
“CSS describes how HTML elements are to be displayed on screen, paper, or in other media.”
– W3Schools
Image credit: Professor Shawn Santo
HTML is structured with “tags,” which indicate portions of the page and can be called by their structure.
There are many types of tags - here are some important ones for scraping:
<h1> - header tags<p> - paragraph elements<ul> - unordered bulleted list<ol> - ordered list<li> - individual list item<div> - division<table> - tableIf you aren’t familiar with CSS, extracting parts of a website can be daunting.
SelectorGadget is incredibly helpful. However, it is only available for Chrome.
Inspect the page elements is also helpful, which is available as a developer tool for most major browsers.
HTML - syntax is easier and aligns with HTML tags
XPATH - useful when the node isn’t uniquely identified with CSS
Set up the environment to scrape the site.
That’s it!
It only seems appropriate to pull data from R books on Amazon.
Ensure we can scrape the site.
We are good to scrape!
Before you get started, you must specificy the URL.
Data as of 2020-07-06.
amazon %>%
html_nodes(".s-line-clamp-2") %>%
html_text() -> titles
head(titles)
#> [1] "\n \n \n \n\n\n\n\n\n \n \n \n GANs in Action: Deep learning with Generative Adversarial Networks\n \n \n \n \n\n\n \n"
#> [2] "\n \n \n \n\n\n\n\n\n \n \n \n R for Data Science: Import, Tidy, Transform, Visualize, and Model Data\n \n \n \n \n\n\n \n"
#> [3] "\n \n \n \n\n\n\n\n\n \n \n \n The Book of R: A First Course in Programming and Statistics\n \n \n \n \n\n\n \n"
#> [4] "\n \n \n \n\n\n\n\n\n \n \n \n R Graphics Cookbook: Practical Recipes for Visualizing Data\n \n \n \n \n\n\n \n"
#> [5] "\n \n \n \n\n\n\n\n\n \n \n \n An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics)\n \n \n \n \n\n\n \n"
#> [6] "\n \n \n \n\n\n\n\n\n \n \n \n Statistical Inference via Data Science: A ModernDive into R and the Tidyverse (Chapman & Hall/CRC The R Series)\n \n \n \n \n\n\n \n"The element pulls a number of breaks and blank spaces.
Let’s clean this up with str_trim.
\n) from the Titlestitles <- str_trim(titles) # Removes leading & trailing space
head(titles)
#> [1] "GANs in Action: Deep learning with Generative Adversarial Networks"
#> [2] "R for Data Science: Import, Tidy, Transform, Visualize, and Model Data"
#> [3] "The Book of R: A First Course in Programming and Statistics"
#> [4] "R Graphics Cookbook: Practical Recipes for Visualizing Data"
#> [5] "An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics)"
#> [6] "Statistical Inference via Data Science: A ModernDive into R and the Tidyverse (Chapman & Hall/CRC The R Series)"This simple function returns cleaned text.
amazon %>%
html_nodes("a.a-size-base.a-link-normal.a-text-bold") %>%
html_text() -> format
head(format)
#> [1] "\n \n \n \n Paperback\n \n \n"
#> [2] "\n \n \n \n Paperback\n \n \n"
#> [3] "\n \n \n \n Kindle\n \n \n"
#> [4] "\n \n \n \n Paperback\n \n \n"
#> [5] "\n \n \n \n eTextbook\n \n \n"
#> [6] "\n \n \n \n Paperback\n \n \n"The price structure splits price into two elements. We must pull each and combine them into a single price.
This element is messier and we’ll need a number of cleaning steps.
amazon %>%
html_nodes("div.a-row.a-size-small") %>%
html_text() -> rate_n
head(rate_n)
#> [1] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.1 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 9\n \n \n \n \n\n\n\n"
#> [2] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.7 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 427\n \n \n \n \n\n\n\n"
#> [3] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.3 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 76\n \n \n \n \n\n\n\n"
#> [4] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.7 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 14\n \n \n \n \n\n\n\n"
#> [5] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.7 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 551\n \n \n \n \n\n\n\n"
#> [6] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 5.0 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 3\n \n \n \n \n\n\n\n"rate_n <- str_trim(rate_n) # trim \n & ' '
head(rate_n)
#> [1] "4.1 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 9"
#> [2] "4.7 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 427"
#> [3] "4.3 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 76"
#> [4] "4.7 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 14"
#> [5] "4.7 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 551"
#> [6] "5.0 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 3"Let’s assemble the file!
An issue with scraping is sometimes you get an uneven number of records due to missing data elements.
We can fix this!
All titles were populated and scraped accurately. However, due to multiple formats, these records must be repeated to fill the dataframe.
Some titles have only 1 format.
Some titles have more than 2 formats.
Nothing needed here!
Or here!
Some books don’t have ratings.
A book only has one rating even if it has multiple formats.
We must also account for multiple formats.
Like titles, the ratings need to be repeated to show on the correct row.
The same corrections are done here.
Some books have only 1 format.
Some books have more than 2 formats.
Not all titles have a rating.
Like titles, the ratings need to be repeated to show on the correct row.
The same corrections are done here.
Some books have only 1 format.
Some books have more than 2 formats.
Create extra rows due to multiple book formats.
Some books have only 1 format.
Some books have more than 2 formats.
r_books <- tibble(title = titles,
text_format = format,
price = price,
rating = rating,
num_ratings = rate_n,
publication_date = pub_dt)
head(r_books)#> # A tibble: 6 x 6
#> title text_format price rating num_ratings publication_date
#> <chr> <chr> <dbl> <dbl> <dbl> <date>
#> 1 R for Data Science: Imp~ Paperback 40.1 4.7 427 2017-01-10
#> 2 R for Data Science: Imp~ Kindle 25.0 4.7 427 2017-01-10
#> 3 The Book of R: A First ~ Paperback 33.0 4.3 76 2016-07-16
#> 4 The Book of R: A First ~ eTextbook 30.0 4.3 76 2016-07-16
#> 5 Discovering Statistics ~ Paperback 34.5 4.5 255 2012-04-05
#> 6 Discovering Statistics ~ Kindle 61.6 4.5 255 2012-04-05
Web Scraping in R & rvest repo
This talk is freely distributed under the MIT License.